Currently, the housing crisis is one of the most prominent societal challenges in the Netherlands. The Dutch housing market is both very competitive as well as inaccessible as it must deal with a supply shortage, which in turn leads to long waiting lists for social housing. The transaction prices of houses are going through the roof, as CBS (Centraal Planbureau voor Statistiek, 2021) revealed that prices in the first quarter of 2021 were 11,3% higher compared to a year before, which is way above the average increase in Europe. Many prospective buyers have to overbid on the listings in order to ensure a place to live. A huge problem is that the amount of mortage is determined by the appraisal value of the house, which causes many to put their own capital into the purchase. This makes it very challenging for new entrants on the market (think of first-time buyers, young professionals) to succeed in renting or buying a house or appartment, especially as many are still paying off large amounts of student debt. It appears that only 3% of first-time buyers are financially able to buy a home without getting themselves into serious financial trouble.
We have considered a few multiple housing sites to incorporate into our project, where Huizenzoeker.nl appeared to be the most suitable option. This website offers a clear view of the Dutch housing market with a wide range of listings, displaying an extensive amount of information (per listing and neighbourhood). Funda.nl currently is the largest housing provider in the Netherlands, however, the site is not useful for this project. Funda installed secure protection for its data to brace for competitor sites. Similarly, Zoekallehuizen.nl offers a large range of listings too, but could not provide us with important information needed to research the housing crisis, e.g. overbidding percentages. Similarly, Remax.nl, is a large housing website, yet, mainly focusing on houses in other countries, like Spain and Belgium. As we are determined to analyse the Dutch housing market by cause of the severe current crisis, Remax.nl has not sufficed to our needs.
Huizenzoeker data used is available at Huizenzoeker.nl.
This is a repository for the course Online Data Collection and Management at Tilburg University as part of the Master’s program ‘Marketing Analytics’, used for the team project of group 3.
Members of our team:
Our datasource ‘Huizenzoeker.nl’ fits well into the data aggregator and low scale/scope category of Figure W3.1: Data Source Exploration of the ‘Fields of Gold’ paper (add a reference). Namely the data available on Huizenzoeker.nl is less detailed, but contains data from multiple platforms (multi-platform data), and it has only regional coverage as opposed to global coverage (only presents data on the housing market for the municipalities in the Netherlands; so for the most part only useful for those living or planning to live in the Netherlands). As for content type, we can put this site into the e-commerce type (??, not sure what fits best).
add the table from the Fields of Gold paper here
Although Huizenzoeker.nl is lesser known (lesser users) than other housing sites (even those on regional level too) such as Funda, it offers the richer data and novel measures we need to answer our research questions. Instead of having to gather the data from numerous pimrary data providers in the housing sector, this data aggregator facilitated our collection of multi-platform data far more efficiently.
As the seriousness of the housing crisis and the shortage of listings differs across the country, we aimed to create a dataset which represented the current housing market for each municipality in every province of the Netherlands. It would clarify which places are hit hardest by the crisis and which the least. With this dataset, we may faciliate these first-time buyers and young professionals in terms of their search to buy or a rent a house by showing them where they would have the highest chances. Furthermore, it would provide them with insights on recent price developments of listings in a certain area, which helps them in negotiations about the purhcase price. Therefore, this dataset provides consumers with other data in addition to what information is offered by their broker (e.g. direct information from the Kadaster). There were already some datasets available on the Dutch housing market, however these did not specifically focus on the overbidding aspect of the current crisis which forms an essential part of our research. Besides that, instead of only focussing on certain parts of the Netherlands we preferred to focus on all municipalities in the Netherlands to get a more complete picture of the current state of the housing crisis. By focusing on municipalities, the units we are analyzing are small enough to deeply dive into the housing market of the Netherlands locally (as opposed to only focusing on provinces), yet, the units are large enough to maintain order and control in our dataset (as opposed to focusing on every house that is for sale in the Netherlands).
Huizenzoeker.nl is an independent platform which is not influenced or moderated by estate agents, as it aims to inform its clients in a honest manner with reliable information. It is perceived as an aggregate site which collects information from public different sources, such as JAAP.nl. However, as Huizenzoeker.nl is owned by Spotzi, a big data visualizations specialist that focuses on the visualizing and analysing of spatial data, The Huizenzoeker team also provides much data themselves. From Spotzi, they retrieve much data on, for example, the value of the listings and development of housing prices. This is beneficial to our scraping project as this resulted in longs lists of information present for each listing, municipality, province etc. Therefore, it does not only provide specifics on the houses themselves like every other site, but also on the neighbourhood, the mean income in the municipality, the distance to the closest supermarket, etc. The platform states that it is a partner of JAAP.nl and Huislijn.nl, however it does not have explicit consent from JAAP.NL to show all information that is displayed on JAAP.nl (which seems quite contradictionary). In turn, databanks like JAAP.nl get their data from other sites, such as Funda.nl.
The dataset is funded by advertisers on the site. Advertisers can target vistors on the site through the filter options, which allows advertisers to target based on different home characteristics or on region and price range. As Spotzi created various profiles from the data they compiled, e.g. starters (young and ambitious), families with children (nest builder), kids away from home (thriving fifties); advertisers can target very specifically to a certain audience within a certain zip code. These profiles can also be used on external sites through the Rearch extension function. Maybe we can still add a little more information here
The instances that comprise the dataset represent all municipalities of the Netherlands for every price; this is one type of instance. Therefore, the entities thus summarize the data of the housing market (for every house) in that municipality; these values thus represent the averages per municipality (step 3 of the navigation path below). The instances are connected to eachother by the province that they are in; therefore all municipalities also belong to a larger type of instance, the provinces (step 2 of the navigation path below).
The instances that comprise the dataset represent houses (maybe more specific, so all houses or houses recently sold or currently available?). However, in our dataset housing data is grouped at muncipality-level, where values represent the average number per municipality (step 3 of navigation path below). In turn, all municipalities belong to a larger type of instance, the provinces (step 2 of navigation path below).
The following screenshots represent a brief navigation path:
I will try to make a gif that zooms in on the url! (Lesley)
The goal of this project has been to scrape information per municipality and per province (for completeness). Therefore, pages like the ones displayed under step 2 and 3 have been utilized to obtain statistical housing-related measures per municipality and province.
If every municipality is seen as an instance, we would say there are 352 municipalities in total, which are spread over the 12 provinces of the Netherlands.
here we call provinces instances again, while I think it should be municipality = instance
To clarify, each province, or instance, knows its own page, but next to that, also is the parent of several municipality pages (as described in question 2.2). However, the structure of this province and municipality page are almost identical. For this question, illustration is based on one of the municipality pages, Tilburg (Noord-Brabant).
First, all pages display a map of the Netherlands and their specific location on the map. Next, all contain a link to all the houses that are for sale followed by a link for all houses that are for rent. Furthermore, a subsequent link directs to themost expensive houses of the munucipality/province in question.
Next, each page displays 4 ‘trend’ statistics. Each of the 4 numbers contains a related percentual number, reflecting the percentual difference of the statistic compared to the month before. The first trend refers to the average selling price of a house within the municipality/province. The second trend refers to the number of houses sold in the past month. The third trend refelcts the average selling price per squared meter. And the fourth trend indicates what the average outbidding percentage is within the municipality/province in question. These trends will be of high importance during our project.
Moreover, all pages cover histograms that show price and housing supply trends. Additionally, a link is included to access more information about the housing market in question.
Next, a section is shown in which several questions are answered in unprocessed text. The first questions, cover the exact same as the first 4 trend statistics. However, the last ones cover the population number and population growth/decline compared to the year before. This population-related information, again, will be of high relevance later in our project.
Furthermore, a pie chart showing the average age distribution in the province/municipality is included. Moreover, a statistic on average disposable income is included, which again will be important later in our project.
Finally, at the bottom of the page random houses that are for sale/rent are displayed, followed by links that navigate to a ‘child’-page (e.g. from province page to municipality page).
From each province in the Netherlands, we intend to scrape all corresponding municipalities. For the provinces an associated URL is for example ‘https://www.huizenzoeker.nl/woningmarkt/noord-brabant/’, which changes to ‘https://www.huizenzoeker.nl/woningmarkt/noord-brabant/tilburg/’ for Tilburg. So, each instance that we want to scrape corresponds to their own URL.
Moreover, within the code we wrote, we extracted the municipality or province name for each of these URLs, by scraping the title and removing the word ‘Woningmarkt’ from it. Therefore, we changed the official label to an artificial one for clarity purposes, e.g. now the municipality Tilburg can be identified through the label ‘Tilburg’, instead of its URL.
*I don’t think they mean using different code blocks for scraping different variables, but more like validating the outcomes and splitting the data into a validation/holdout sample and a calibration sample (e.g. one to estimate and one to validate the results; like in IRM) ??
The data that is present on every province and municipality page has a similar structure. Due to this structure we were able to split the data into multiple variables. This is recommended as it allows for quicker comparing and interpreting of certain statistics among provinces and among municipalities.
The first variable we split off is the average selling price of a house for every page. Additionally, the percentual difference of this average price compared to the last month is extracted. These numbers allow for comparing in which municipalities the most expensive/cheap houses are located on average, and in which municipalities the growth in price is the deepest/slowest.
Furthermore, the next variable is the number of houses sold within a month. Next to that, the percentual difference in this number compared to the month before has been added as well. This data is relevent as it indicates which municipalities are most popular among the population of the Netherlands, and which municipalities are becoming more and more popular by cause of large growth in house sales. In the future, it might me relevant to expand this project to find out why these regions sell the most houses. Is the price the lowest? Are the overall home features superior?
Next up, we split the average price per square meter for every municipality into a variable. And again, another number indicating the percentual difference compared to last month is included. These numbers are important, as this information contains a measure of price relative to a certain size. It can be difficult to compare value-for-money by just looking at the absolute average selling price of a house. Certain houses are larger, and thus, are sold for a higher price. Within the average selling price per square metre we control for this issue.
The next split represents a percentual number on how much is outbid per municipality on average. Again, the percentual difference of this number compared to the month before is included. This number is important as it indicates in which regions buyers are willing to pay the highest ‘extra’ amount of money for a house. This might translate in where the competition for a house is the highest on average or how bad buyers want to be secure to attain the house.
Furthermore, we scraped and split the average disposable income per municipality into a variable as well. The average disposable income is an important measure as it tells us how much inhabitants of a certain municipality are able to spend on a house. Thus, to match supply to demand, this variable might have an impact on the level of housing prices.
Lastly, several measures in number of inhabitants have been split off. Firstly, the number of inhabitants. Secondly, the percentual number of inhabitant growth over the past year if applicable. And thirdly, if the population growth is not applicable, a percentual number of inhabitant decline is included.
The paragraphs above simply clarify what the all variables entail, which ones belong together, and why they are important to include in our dataset. However, within the jupyter script, all the above-mentioned variables have been included into one table. We chose to display everything together to make all the information quickly accessible. We think it is no problem to consider all the variables into code like we do now because of the following reason. Huizenzoeker.nl updates its content each month automatically. The structure and the urls stay exactly the same, yet, the statistical numbers change per month. We designed our code in a way that it captures every number that is present at the moment, regardless of whether we run it in September versus October for example.
Huizenzoeker.nl states that, for over 10 years, they have made every effort possible to ensure that this website functions properly and is kept permanently accessible for reputational reasons. Huizenzoeker.nl edits the information offered on its site with the greatest possible care and devotes the same care to the composition of the site. However, it legally cannot guarantee the correctness and completeness of the data shown as a result of imperfections that may occur. Moreover, Huizenzoeker is able to adapt the website where and whenever they please. No restrictions hold. This information has been retrieved from the disclaimer section on the officiel Huizenzoeker website.
Possibly for own utilization. However, no official arcihval versions of the complete datasets are available to us as the public of Huizenzoeker.nl. Huizenzoeker.nl displays real-time data, and not so much archival data for the data we scrape to answer our research objective (there is for instance data on the ‘prijsontwikkelingen’ over the last couple of years, which means data for the previous years must be available too). The data we scrape from the municipality pages is data that is updated every month, so when scraping this page you do not get direct access to the figures or averages for the previous months.
The external resources include JAAP.nl and Huislijn.nl, who in turn extract data from other sites such as Funda.nl. These sites are all available for free, thus, no restrictions are present in the form of licenses and fees for future users. There is a premium (under ‘Abonnementen’)part of Huizenzoeker.nl for which you do need to pay to access it. For our project, the premium information was irrelevant.
No, the data is not confidential. Therefore users do not have any rights to remove listings from the Huizenzoeker site. Only if their house is no longer for sale/rent on JAAP.nl, their listing will be removed. However, information on the house itself such as its value, year of construction, property size, will remain available. This information is considered as public.
No the data can in no way be perceived as offensive, insulting, or threatening.
Not applicable.
Within out dataset we only scrape the number of inhabitants per municipality/province, and the average disposable income per municipality/province. Therefore, one subpopulation in terms of different levels of average disposable income can considered to be present.
LOOK FOR DISTRIBUTIONS FOR THIS SUBPOPULATION !!!!!
I don’t think we have subpopulations, as we scrape houses and not people, so I don’t think we need to identify subpopulations; we scrape data for every person able to buy a house and don’t target only starters who are buying a house, or only very rich people buying villas
Not applicable.
We scraped the data using Python’s programming software in Jupyter Notebooks. By loading the packages BeautifulSoup, Selenium, requests, re, pandas, time, webdriver manager, and json, we were able to use functions allowing for our specific webscraping steps.
Huizenzoeker.nl does not provide an official software API (anymore), so we scraped the data by writing code ourselves.
Technically, we have taken the entire population, and no sample, to conduct our project with. We took all the municipality pages as input, an not a portion of them.
Yet, logically, we have taken a sample. Namely, a single unit would represent a single house in logical terms. However, as the statistics we were after were only available on an average-level on the municipality pages, we took the municipality pages as single units. A municipality page consists of average numbers from all the single houses present in that region. Thus, that is the sampling strategy applied.
In the data collection process solely the team members of this project were involved.
Huizenzoeker.nl covers the housing market data of October 2021. This is the most recent housing market data. Huizenzoeker.nl shows this most-recent data because the housing market changes every month (e.g., houses are sold, new houses are offered, the asking price may be more extremely outbid in one month than in the other month, etc.).
Not applicable.
Not applicable.
Not applicable.
Not applicable.
Not applicable.
Not applicable.
Not appliable.
First of all, all the values of the variables have been cleaned in a way that they only give a certain numeric value or percentage as output (no additional words, and only consistent punctuation). This means removing the HTML tag words, stripping out unncessary characters and retaining relevant substrings only. To achieve this, we made use of regular expressions (regex) to pre-process the textual data. When no numeric value exists for a specific municipality, we encoded that ‘NA’ will result as output for the variable in question. Furthermore, all the variables have been assigned a clear label, such that the numeric values are given a meaning. For example, we identified values as provinces, cities, and for all variables. Additionally, all the variables have been displayed in a table against all the municipalities/provinces as a small start in preprocessing.
Add more info on this by looking at Fields of Gold paper: step 4: data extraction
Yes, the raw output is being saved in a json file automatically, as part of our coding script. The json file can be accessed by running our Scraping Woningmarkt (Final Code) jupyter script. Add more information here from the storage/deployment section of Fields of Gold under step 4
We decided to use self-developed code that interfaces with high-level scraping libraries (e.g. Selenium and BeautifulSoup), as a software tool for data extraction. We did not choose to use a ready-made scraping toolkit like Monzenda, or packages that only require some coding like Scrapy for Python, as for the complexity of our data collection the self-developed code method seemed the most desirable. Developing the code ourselves in Python required quite some time and effort, however in the end it is a better way to actively manage the data quality and reproductibility than through the other methods. After preprocessing, cleaning, and labelling the data in Python, we exported the dataset to RStudio where we transformed the dataset into one ready for analysis. Python and Rstudio, and the libraries Selenium and BeautifulSoup are all publically available. Provide link? So together with our code, you can replicate our scraping efforts easily.
Maybe add more info on how we did what in the Rfile, not sure
We used our dataset in RStudio to create some plots and figures of the data we collected. We did this to give insights into how we would compare the municipalities for each province, and the data between the provinces.
There are not many (if any) papers or systems that use this dataset, so there is not really such repository. On Github, we found some respositories for: * A simple python wrapper for the Huizenzoeker API (but last updated in oct 2013) = https://github.com/bpeschier/huizenzoeker * Using the Jaap API to look for rentals in Rotterdam = https://github.com/thomasvt1/HuizenZoeker So in sum not that interesting…
Adjust this section still
Broadly speaking, a suitable task this dataset can be used for is helping (future) inhabitants of the Netherlands find their ideal home. By accessing our data, a person could find the best municipality to live in for this person’s specific circumstances (e.g. specific disposable income level), find a region where the value-for-money seems to be of high standard, to help them in negotiations on the price, to help them find out what is the norm in terms of overbidding for each municipality, and more.
The only harms that could be done in the case of Huizenzoeker relates to financial harms, e.g. when one is misinformed about housing prices due to our dataset.
However, as long as Huizenzoeker does not incur drastic changes, no undesirable harms will arise. *also when there would be drastic changes on Huizenzoeker.nl then our code likely won’t work anymore so in that case our scraper won’t result in undesirable harms either, but just won’t work)
Future users of the dataset could decide to implement more variables, or kick certain variables out. As long as this is done following the same steps as in our coding script, no harm can be done.
The dataset can be used for any matters regarding the housing market in the Netherlands, at municipality level as well as province level. For anything outside of this topic, the dataset has no use.